Propose a candidate state \(J\left (x_t | x_{t-1} \right )\) using some proposal distribution \(J\)
Accept proposal with probability \(A = min \left (1, \frac{f(x_*)}{f(x_n)} \times \frac{q(x_n | x_*)}{q(x_* | x_n)} \right )\)
Note: if \(J\) is symmetric (\(J\left (x_t | x_{t-1} \right ) = J\left (x_{t-1} | x_t \right )\)), the bias correction term \(\frac{q(x_n | x_*)}{q(x_* | x_n)}\) goes away, simplifying to the Metropolis Algorithm
Proposal Width: how far away from current value we’re willing to go
Burn-in: how long the chain gets to reach stationary distribution
Thinning: how many samples we thin to get rid of autocorrelation
Note: Why do you think the trace plot look like this?
Multidimensional (multiple parameters) samples
\(p \left (x_1, x_2, ... x_n \right)\) is tough to sample but \(p \left (x_n | ... \right)\) is not
Pros: special case of MH where proposal is always accepted
Cons: need \(p \left ( x_1^{(1)} | x_2^{(0)},...,x_n^{(0)} \right)\) …etc
Start with \(\mathbf{x^{(0)}} = \left ( x_{1}^{(0)}, x_{2}^{(0)},...x_{n}^{(0)} \right)\)
Sample \(x_1^{(1)} \sim p \left ( x_1^{(1)} | x_2^{(0)},...,x_n^{(0)} \right)\)
Sample \(x_2^{(1)} \sim p \left ( x_2^{(1)} | x_1^{(1)},...,x_n^{(0)} \right)\)
Repeat for all \(x_i \in \mathbf{x}\)
Problem: MH + GS work well, but proposals are inefficient, or difficult to get
Solution: use information about shape of target distribution to generate better proposals
\[ \frac{1}{\sqrt{2\pi\sigma^2}} e^{-\frac{(x-\mu)^2}{2\sigma^2}} \]
generate random momentum vector (a “flick” of the friction-less particle)
use a physics simulation that takes current position \(x_0\) and momentum \(w_0\) and generates a new state \(\left ( x_T, w_T \right)\)
accept with \(A = min \left (1, \frac{exp(-H(x_T, w_T))}{exp(-H(x_0, w_0))} \right )\)
Note: HMC tends to produce states with similar energy (energy should be totally conserved, theoretically), so acceptance rate is high
\[ H(q,p) = \underbrace{U(q)}_\text{potential energy} + \underbrace{K(p)}_\text{kinetic energy} \]
\(U(q)\): potential energy (potential energy is high when you’re on top of a “hill”)
\(K(p)\): kinetic energy (kinetic energy is high when you’re moving fast)
Hyperparameters:
Leapfrog Step Size
Leapfrog Step #
\[ A = min \left (1, \frac{exp(-H(x_T, w_T))}{exp(-H(x_0, w_0))} \right ) \]
when energy \(H(x,w)\) stays the same or decreases, acceptance prob is high
when energy \(H(x,w)\) increases, acceptance prob is low because the proposed sample is unlikely
show curve and how discretized steps cause div when step size is too small
Effective Sample Size (ESS)
Trace plots/\(\hat{R}\) (r-hat)
(HMC) Divergent Transitions
Question: have our chains converged to the target distribution?
Answer:
start chain from different values
compare samples from chains
should look like a fuzzy caterpillar (not kidding)
\(\hat{R}\) is a metric that summarizes similar information as a trace plot
\[ \hat{R} = \frac{\hat{V}}{W} \]
where \(\hat{V}\) is the pooled variance of all the chains, and \(W\) is the within chain variance.
Rule of Thumb: R-hat < 1.1
\[ N_{eff} = \frac{N}{1 + 2\sum_{t=1}^{\infty} \rho_t} \]
How many independent random samples would we need to get the same amount of information as our MCMC samples give us
Measures efficiency of samples, but there’s nothing inherently wrong with low ESS.
Rule of Thumb: ESS > no. chains * 100
In HMC, when we reject proposals when the estimated trajectory diverges from the true trajectory (measured by \(H(x_T, w_T) - H(x_0, w_0)\))
Lots of divergences often means biased samples because HMC cannot effectively explore one or more parts of the distribution.
Don’t calculate, simulate!
# X ~ N(0,1)
X <- rnorm(1e5,0,1) # n,mu,sd
Z <- X*4.5 + 10
W <- X^2
print(paste0("Z: ", round(mean(Z),2), ", W: ", round(mean(W),2)))[1] "Z: 10, W: 1"
Family: gaussian
Links: mu = identity; sigma = identity
Formula: bill_length_mm ~ bill_depth_mm
Data: penguins (Number of observations: 342)
Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
total post-warmup draws = 4000
Regression Coefficients:
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept 55.15 2.51 50.29 60.02 1.00 4320 3206
bill_depth_mm -0.65 0.15 -0.93 -0.37 1.00 4235 3167
Further Distributional Parameters:
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sigma 5.33 0.21 4.94 5.76 1.00 4741 2891
Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).
[1] "0.14575 proportion of posterior samples for b_bill_depth_mm are >= 0.05"